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^ , Abstract 

^: 

•^ . A C(j implementation of a generalized k-means variant called relational k-means 

\f^ ! is described here. Relational k-means is a generalization of the well-known k-means 

clustering method which works for non-Euclidean scenarios as well. The input is an 

arbitrary distance matrix, as opposed to the traditional k-means method, where the 

C^ | clustered objects need to be identified with vectors. 
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Suppose that we are given a set of data points p%, ...p n G M. d and a number of clusters 



N. We would like assort these data points "optimally" into iV clusters. For each point Pi 
let Zi denote the center of gravity of the cluster p^ is assigned to. We call the z^ vectors 
centroids. The standard k-means method |2J attempts to produce such a clustering of the 
data points that the sum of squared centroid distances Yl7=i \\Pi~ z i\\ 2 ^ s minimized. 

The main difficulty of this method is that it requires the data points to be the elements 
of a Euclidean space, since we need to average the data points somehow. A generalization 
of k-means called relational k-means has been proposed [3j to address this issue. This 
generalized k-means variant does not require the data points to be vectors. 

Instead, the data points p±, ...p n can form an abstract set S with a completely arbitrary 
distance function / : S x S — > [0, oo). We only require that / is symmetrical, and 
f(pi,Pi) = for all pi G S. Note that / need not even satisfy the triangle inequality. 



2 Relational k-means 

First, we provide a brief outline of the algorithm. Let A G M. nxn be the squared distance 
matrix. That is, Ay = f{pi,Pj) 2 - The algorithm starts with some initial clustering and 
improves it by repeatedly performing an iteration step. The algorithm stops if the last 



iteration did not decrease the value of the clustering (defined below). Of course, if the 
iteration increased the value of the clustering, the algorithm reverts the last iteration. 

Now we describe the algorithm in detail. At any time during execution, let S\, ...Sn C 
S denote the clusters. For each data point pi, let £(pi) denote the index of the cluster pi 
is assigned to. Let ei £ W 1 denote the ith standard basis vector, and, for i £ {1, ...n} and 
j £ {1, ...N} let Vij := |Ji^ fceS efc — e,. Let us call the quantity q^ := —^vJjAvij the 
squared centroid distance corresponding to the point pi and the cluster Sj. 

In |3] it is observed that, if the distance function is derived from a Euclidean represen- 
tation of the data points, then qv, equals to the squared distance of p^ and the center of 
gravity of Sj. Thus qij is indeed an extension of the classical notion of squared centroid 
distances. 

Define the value of a clustering as Y^7=i Qm- We say that a clustering is better than 
another one if and only if its value is less than the value of the other clustering. 

The relational k-means algorithm takes an initial clustering (e.g. a random one), and 
improves it through repeated applications of an iteration step which reclusters the data 
points. The iteration step simply reassigns each data point to the cluster which minimized 
the squared centroid distance in the previous clustering. If the value of clustering does not 
decrease through the reassignment, then the reassignment is undone and the algorithm 
stops. 

In the non-Euclidean case there might be scenarios when an iteration actually wors- 
ens the clustering. Should this peculiar behavior be undesirable, it can be avoided by 
"stretching" the distance matrix and thus making it Euclidean. 

Stretching means replacing A with A + /3(J — I), where J is the matrix whose entries 
are all l's, / is the identity matrix, and j3 > is the smallest real number for which 
A + j3{J — I) is a Euclidean squared distance matrix, i.e. it equals to the squared distance 
matrix of some n vectors. It can be easily deducted that such a j3 exists. This method is 
called /3-spread transformation (see [lj). 

The algorithm is thus as follows: 

• Start with some clustering, e.g. a random one 

• Calculate the value of the current clustering and store it in V\ 

• For each pi data point, calculate and store the squared centroid distances qn, ...gjjv 

• Reassign each p^ data point to the cluster that yielded the smallest squared centroid 
distance in the previous step 

• Calculate the value of the current clustering and store it in V 2 

• If V 2 > Vi, then undo the previous reassignment and stop 

• Go to line number 2 




3 Time complexity 

The algorithm is clearly finite because it gradually decreases the value of the current 
clustering and the number of different clusterings is finite. Each iteration step can easily 
be implemented in 0(n 3 ) time: for each data point, we need to calculate N quadratic 
forms, which can be done in n^2- =1 \Sj\ 2 < n 3 time. This is unfortunately too slow for 
practical applications. 

However, this can be improved down to 0(n 2 ). The squared centroid distances can 
be transformed as follows (using An = 0): 



** = "o I — 1^ e * - e * I A ! — Y, e * ~ e * I = ~^l2 S Aab + To] J2 Aik - 

Here the first summand is independent of i and thus needs to be calculated for each 
cluster only once per iteration. On the other hand, the second summand can be calculated 
in C ( | S'j | ) time. To sum up, the amount of arithmetic operations per iteration is at most 
constant times X^'=i \^j\ 2 + n ^j=i \^j\ — ^n 2 - 

The current implementation makes several attempts to find a better clustering. In 
each attempt, the full relational k-means algorithm is run, starting from a new random 
clustering. Every attempt has the possibility to produce a clustering which is better than 
the best one among the previous attempts. If the sofar best clustering has not been 
improved in the last K attempts (where K is a parameter), then it is assumed that the 
clustering which is currently the best is not too far from the optimum, and the program 
execution stops. 

Attempts do not depend on each other's result and do not modify shared resources 
(apart from a shared random generator). Our implementation uses Parallel. For for 
launching multiple attempts at once. The number of parallel threads can be customized 
via a command-line switch, and by default equals to the number of logical processors. 
This results in near 100% CPU utilization. Launching less than C threads allows leaving 
some CPU time for other processes. 



4 Test results and conclusion 

We implemented the above algorithm in C% and run the program on a set of >1000 
proteins with a Levenshtein-like distance function. The value of K (maximum allowed 
number of failed attempts, i.e. the "bad luck streak") was 20, and the value of iV (number 
of clusters) was 10. The testing was done on an average dual core laptop computer and 
finished in 30. .60 seconds. This proves that relational k-means can be implemented in a 
way efficient enough to be applied to real-world datasets. 

Since the program is almost fully parallelized, we could expect it to finish in <8 seconds 
for the same dataset on a 16-core machine. Note that the total runtime is proportional 



to the number of attempts made, which is highly variable due to the random nature of 
the algorithm. 

A C++ implementation could further reduce execution time. According to our esti- 
mate, it could make the program cca. twice as fast. 



5 Attachments 

5.1 Program source code in C% 

//Generalized k-means 

//Copyright (C) 2013 Balazs Szalkai 

//If you use this program in your research, please cite the following article: 

//B . Szalkai : An implementation of the relational k-means algorithm. ArXiv e -print s , 2013 . 

//This program is free software: you can redistribute it and/or modify 
//it under the terms of the GNU General Public License as published by 
//the Free Software Foundation, either version 3 of the License, or 
//(at your option) any later version. 

//This program is distributed in the hope that it will be useful, 
//but WITHOUT ANY WARRANTY; without even the implied warranty of 
//MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 
//GNU General Public License for more details. 

//You should have received a copy of the GNU General Public License 
//along with this program. If not, see <http://www.gnu.org/licenses/>. 

using System; 

using System. Collect ions . Generic ; 

using System. Globalization; 

using System. 10; 

using System. Reflection; 

using System. Threading; 

using System. Threading. Tasks; 

namespace GeneralizedKMeans 

■c 

static class Utils 
{ 

publ i c st at i c double Sqr (double x ) 

■c 

return x * x; 
} 
} 

class DistanceMatrix 
i 

public int Count ; 

public string [] Name; 

public double [,] SqrDi stance; 

public DistanceMatrix () 

-c 

Resize (0) ; 
} 

public void Resize(int NewCount) 

■c 

Count = NewCount ; 
Name = new string [Count] ; 
SqrDistance = new double [Count , Count]; 
> 

public void Load(string fn) 
■C 

string [] lines = File.ReadAllLines (fn) ; 

Resize (Array . IndexOf (lines , "//")); 

for (int i - 0; i < Count; i++) 
{ 

Name [i] = lines [i] .Trim ; 

string[] distLine - lines [Count + 1 + i] . Split( ' ; ') ; 

for (int j = 0; j < Count; j++) 

SqrDistance [i, j] = Utils .Sqr (double. Parse (distLine [j] . TrimO , Culture Info. Invar iantCulture) ) ; 
} 
> 
} 

class CentroidDi stance 

public Clusters Clusters ; 



double [] AdditiveConstant ; 

public CentroidDi stance (Clusters clusters) 
{ 

Clusters = clusters; 
} 

// Call this before using the other members functions 

public void Initialize () 

{ 

// calculate additive constants for clusters 

AdditiveConstant = new double [Clusters .Count] ; 

for (int cluster = 0; cluster < Clusters .Count; cluster++) 

var objects = Clusters. ClusterToObjects [cluster] ; 

double x = 0; 

foreach (var i in objects) 
foreach (var j in objects) 

x += Clusters .Matrix. SqrDistance [i, j] ; 
AdditiveConstant [cluster] = -x / (2 * Utils . Sqr( (double) objects. Count)) ; 
> 
} 

// Calculates a squared centroid distance 

public double CalculateSqrDist(int obj , int cluster) 

■c 

var objects = Clusters .ClusterToObjects [cluster] ; 

double x = 0; 

foreach (int j in objects) x += Clusters .Matrix. SqrDistance[obj , j] ; 

double dist - AdditiveConstant [cluster] + x / objects .Count; 

return dist; 
} 

public int GetNearestCluster (int obj ) 

■c 

int nearestCluster = -1 ; 

double minSqrDist = 0; 

for (int cluster - 0; cluster < Clusters .Count; cluster++) 

{ 

double sqrDist - CalculateSqrDist (obj , cluster) ; 
if (nearestCluster < I I sqrDist < minSqrDist) 
{ 

minSqrDist = sqrDist ; 
nearestCluster = cluster; 
> 
> 

return nearestCluster; 
} 

public double GetClusteringValueO 

{ 

double result = 0; 

for (int i = 0; i < Clusters. Matrix. Count ; i++) 

result += CalculateSqrDist (i, Clusters .ObjectToCluster [i] ) ; 
return result; 
} 
} 

class Clusters 
{ 

public DistanceMatrix Matrix; 

publ i c int Count ; 

public int [] ObjectToCluster; 
public List<int> [] ClusterToObjects; 

private CentroidDi stance cd; 

public Clusters (DistanceMatrix matrix , int count) 
i 

Matrix = matrix; 

Count = count ; 

ObjectToCluster - new int [Matrix. Count] ; 
ClusterToObjects = new List<int> [Count] ; 
for (int i - 0; i < Count; i++) ClusterToObjects [i] = new List<int>(); 

cd = new CentroidDistance (this) ; 

Randomize () ; 
} 

private static Random Rand - new Random () ; 

public void Randomize () 
{ 

for (int i - 0; i < Count; i++) ClusterToObjects [i] .Clear () ; 

lock (Rand) 

{ 



for (int i = 0; i < Matrix. Count ; i++) 
{ 

int cluster = Rand. Next (Count) ; 
ObjectToCluster [i] - cluster; 
ClusterToObjects [cluster] .Add(i) ; 
> 
} 
} 

private void Make_ClusterToQbjects() 

■c 

for (int cluster = 0; cluster < Count; cluster++) 

ClusterToObjects [cluster] . Clear O ; 
for (int i = 0; i < Matrix . Count ; i++) 

ClusterToObjects[ObjectToCluster[i]] .Add(i) ; 
} 

// Returns how many objects have changed cluster 
publ i c int 1 1 er at e ( ) 

{ 

int changed = 0; 
cd. Initialize () ; 

// find the nearest centroid for the objects 

int [] New_ObjectToCluster - new int [Matrix. Count] ; 

for (int i = 0; i < Matrix . Count ; i++) 

■C 

New^0bjectToCluster[i] = cd.GetNearestCluster(i) ; 

if (New_ObjectToCluster[i] != ObjectToCluster [i] ) changed++; 
> 

// update the configuration 

double oldValue - Get Value (); 

int[] 01d_0bjectToCluster = ObjectToCluster; 

ObjectToCluster - New_ObjectToCluster; 

Make„ClusterToObjects() ; 

double newValue = Get Value (); 

// clustering got worse (this is possible with some distance matrices) or stayed the same? 

if (oldValue <= newValue) 

{ 

// undo iteration 

ObjectToCluster = 01d_0bjectToCluster ; 

Make_ClusterToObjects() ; 

return 0; 
> 

return changed; 
} 

public double Get Value 

■c 

cd. Initialize () ; 
return cd.GetClusteringValueO ; 
> 

public void Dump(TextWriter w) 

■c 

w.Wr it eLine (Matrix. Count + " ; objects") ; 
w. Wr it eLine (Count + " ; clusters") ; 
w.WriteLine(GetValue() + " ; value") ; 
for (int i = 0; i < Matrix . Count ; i++) 

w.WriteLine(ObjectToCluster[i] + ";<-;" + Matrix. Name [i] ) ; 
for (int cluster - 0; cluster < Count; cluster++) 
{ 

w.Write(cluster+" ;->"); 

foreach (var i in ClusterToObjects [cluster] ) w. Write (" ; " + Matrix. Name [i] ) ; 

w.WriteLineO ; 
} 
} 
} 

class Program 
{. 

const int maxBadLuckStreak_Def ault = 100 ; 

void Run(Dictionary<string,string> args) 
i 

string inputFile - args["i"]; 

int nClusters = int .Parse (args ["n"] ) ; 

string outputFile - args .ContainsKeyO'o") ? args["o"] : null; 

int maxBadLuckStreak * args .ContainsKeyC'm") ? int .Parse (args ["m"] ) : maxBadLuckStreak_Default; 

int nThreads = args .ContainsKeyC't") ? int .Parse (args ["t"] ) : 0; 

if (nThreads <- 0) nThreads += Environment .ProcessorCount ; 
nThreads = Math. Max (1, nThreads); 

DistanceMatrix m = new DistanceMatrixQ ; 
m.Load(inputFile) ; 

Clusters bestClusters = new Clusters (m, nClusters); 

int badLuckStreak - 0; 



int blockld = 0; 

Clusters [] clustersArray = new Clusters [nThreads] ; 

while (true) 

{ 

Parallel. For (0, nThreads, i => 
{ 

Console .WriteLineO'Attempt " + (nThreads*blockId + i + 1)); 
Clusters c - new Clusters (m, nClusters) ; 
while (true) 
{ 

Console. Wr it eLine ("Current value is " + c. Get Value () ) ; 
if (c.IterateO — 0) break; 
> 
clustersArray [i] = c; 

»; 

blockId++; 

Console .WriteLineQ ; 

for (int i = 0; i < nThreads; i++) 

■c 

Clusters c = clustersArray [i] ; 

if (c.GetValueQ < bestClusters .GetValueO) 

{. 

Console .WriteLine( "Found a better clustering") ; 

bestClusters = c; 

badLuckStreak = 0; 

> 

else badLuckStreak++; 
} 
Console .WriteLineO ; 

if (badLuckStreak >= maxBadLuckStreak) break ; 
> 

Console. Writ eLine ("Final value is " + bestClusters. GetValueO ) ; 
Console. WriteLineO ; 

// output the clustering 
if (outputFile != null) 
{. 

using (StreamWriter sw - new StreamWriter(outputFile) ) bestClusters. Dump(sw) ; 
> 

else bestClusters .Dump (Console. Out) ; 
} 

static void Main(string[] args) 

Console .Writ eLine ( "Generalized k-means , (c) 2013 Bal\x00Elzs Szalkai" ) ; 

Console. WriteLine( "Released under GNU GPL version 3"); 

Console. WriteLine("If you use the program in your research, please cite the following article:"); 

Console. WriteLine("B. Szalkai: An implementation of the relational k-means algorithm. ArXiv e-prints, 2013."); 

if (args. Length ™ 0) 
{. 



Console .WriteLine( 
Console .WriteLine( 
Console .WriteLine( 
Console .WriteLine( 
Console .WriteLine( 
Console .WriteLine( 



"Command-line arguments : " ) ; 

i <input f ile> [required] ") ; 

n <number of clusters> [required]"); 

o <output file> [optional]"); 

m <max. bad luck streak parameter> [optional, default: " + maxBadLuckStreak_Def ault + "]"); 

t <number of threads> [optional , default : 0] "+ 
"The number of concurrent threads. If less or equal to zero, then it is relative to the number of logical processors"); 
Console .WriteLine("Example : GeneralizedKMeans i input.txt n 3 o output.txt"); 
return; 
} 

Dictionary<string, string> parsedArgs = new Dictionary<string, string> () ; 

for (int i - 0; i + 1 < args. Length; i += 2) parsedArgs [args [i] .ToLowerO] = args [i + 1] ; 

if ( ! parsedArgs . Cont ainsKey ( " i" ) ) 
{ 

Console .WriteLine( "Missing argument : input file") ; 

return; 
} 

if ( ! parsedArgs. Cont ainsKeyC'n")) 
{ 

Console .WriteLine( "Missing argument : number of clusters") ; 

return; 
} 

new ProgramO .Run(parsedArgs) ; 
> 
> 
} 



5.2 Example input 

The first n lines of the input contain the names of the objects which need to be clustered. 
Then a line containing two slashes follows. After that, a matrix containing the pairwise 
distances is listed in semicolon-separated CSV format. Fractional distances must be input 
with the dot character ( . ) as decimal separator. 



object name 1 

object name 2 

object name 3 

object name 4 

object name 5 

object name 6 

object name 7 

object name 8 

object name 9 

object name 10 

// 

0;1;1 

1;0;1 

1;1;0 

1;1;1 

1;1;1 

1;1;1 

1;1;1 

1;1;1 

1;1;1 

1;1;1 
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